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Abstract — Microarray cancer gene expression data comprise of 
very high dimensions. Reducing the dimensions helps in improv- 
ing the overall analysis and classification performance. We pro- 
pose two hybrid techniques, Biogeography - based Optimization 
- Random Forests (BBO - RF) and BBO - SVM (Support Vector 
Machines) with gene ranking as a heuristic, for microarray gene 
expression analysis. This heuristic is obtained from information 
gain filter ranking procedure. The BBO algorithm generates a 
population of candidate subset of genes, as part of an ecosystem 
of habitats, and employs the migration and mutation processes 
across multiple generations of the population to improve the 
classification accuracy. The fitness of each gene subset is assessed 
by the classifiers - SVM and Random Forests. The performances 
of these hybrid techniques are evaluated on three cancer gene 
expression datasets retrieved from the Kent Ridge Biomedical 
datasets collection and the IibSVM data repository. Our results 
demonstrate that genes selected by the proposed techniques 
yield classification accuracies comparable to previously reported 
algorithms. 

I. Introduction 

Microarray gene expression experiments help in the mea- 
surement of expression levels of thousands of genes simulta- 
neously. Such data help in diagnosing various types of tumors 
with better accuracy. The fact that this process generates a lot 
of complex data happens to be its major limitation. Normally 
the number of genes (features) is much greater than the 
number of samples (instances) in a microarray gene expression 
dataset. Such structures pose problems to machine learning and 
make the problem of classification difficult to solve. This is 
mainly because, out of thousands of genes, most of the genes 
do not contribute to the classification process. As a result 
gene subset selection acquires extreme importance towards 
the construction of efficient classifiers with high predictive 
accuracy. 

To overcome this problem, one way is to select a small 
subset of informative genes from the data. This technique 
which is known as gene selection or feature selection helps in 
tackling overfitting by getting rid of noisy genes, reducing the 
computational load and in increasing the overall classification 
performance of the learning models. 



Gene selection algorithms are mainly categorized as : wrap- 
pers and filters. Wrappers make use of learning algorithms to 
estimate the quality or suitability of genes to the modelling 
problem. Optimization algorithms in combination with various 
classifiers fall into this category as described in |Q], Q, 0. 
On the other hand, filters J4[ evaluate the genes considering 
their inherent characteristics without making use of a learning 
algorithm. Filters, therefore, give an insight into the properties 
of the dataset we use. Algorithms based on statistical tests and 
mutual information are some examples of filters. 

This paper presents hybrid BBO - RF and hybrid BBO - 
SVM approaches for simultaneous informative gene selection 
and high performance classification. Additionally, for enhanc- 
ing performance we provide information gain gene ranking 
as heuristic knowledge to our BBO algorithm. It traverses 
the enormously large search space by using this ranking 
information to iteratively obtain informative gene subsets. 
The selected subsets of genes (candidate solutions) in each 
generation are subsequently evaluated by SVM and Random 
Forests CV (cross validation) accuracies. 

II. Methodology 

A. Biogeography-based Optimization 

Biogeography is the study of geographical distribution of 
species over geological period of time. Biological literature 
on the same is massive. In 2008, for the first time, Simon [5] 
applied the biogeography analogy to the idea of engineering 
optimization and thus introduced the Biogeography-based Op- 
timization (BBO) technique. It is a population based method 
that works with a collection of candidate solutions over gener- 
ations. It attempts to explore the combinatorially large solution 
spaces with a stochastic approach like many other evolutionary 
algorithms H, Q. It mimics the geographical distribution of 
species to represent the problem and its candidate solutions 
in the search space, subsequently using the process of species 
migration and mutation to redistribute solution instances across 
the search space in quest of globally optimal or near optimal 
solutions. 



BBO, as is or in variations, has been explored for vari- 
ous combinatorial and constrained/unconstrained optimization 
problems [8| including the likes of the Traveling Salesman 
Problem [9|, [101, satellite image classification ifTTIl and sensor 
selection [5] among others. But as of 2012, no work is reported 
of using BBO as a gene selection technique for microarray 
gene expression data analysis. We attempt to study BBO for 
gene selection and classification in this work. 

In BBO, there exists an ecosystem (population) which in 
turn consists of a number of habitats (islands). Each habitat has 
a habitat suitability index (HSI), which is similar to a fitness 
function and depends on many features/attributes of the island. 
If a value is assigned to each feature, then the HSI of a habitat 
H is a function of these values. These variables characterizing 
a habitat's suitability collectively form the 'suitability index 
variables' (SIVs). Thus, 

HSI(HabitaU) f(SIV t , SIV 2 , SIV m ) 

For the problem of gene selection, the SIVs of a habitat 
(candidate solution) are the selected subsets of genes out of 
the set of all genes. The ecosystem is therefore a random 
collection of candidate gene subsets. 

A good solution is thus analogous to a good HSI and 
vice versa. Good HSI solutions tend to share SIVs with poor 
HSI solutions. This form of sharing, termed as migration, is 
controlled by emigration and immigration rates of the habitats. 
We have purposely kept the model simple and have obeyed the 
original simple linear model for migration as shown in Figure 
ffl 
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Fig. 1. Migration rate vs. No. of species 

where E and I are the maximum emigration and immigration 
rates, both typically set to 1. The individual immigration and 
emigration rates (A and fi respectively) are calculated by the 
same formulae for this simple linear model as in 0. 

x k = 1 I 1 - h 



Ek 



fa 



where k is the iterator for the n habitats. 

B. BBO Gene Selection Algorithm 

We present our BBO algorithm for performing gene selec- 
tion. For our problem, we treat a gene (identified by gene 



number) as an SIV for a habitat and each habitat has m SIVs 
(arity m as Habitat H £ SIV m ). For example, if a habitat is 
Habitah = {12,345,26,7,141} then the SIVs 12, 345, 
141 are the selected gene numbers out of say a collection of 
500 genes and the subset size is 5 genes. The tuned parameter 
values for our algorithm are given in the next section. The 
BBO gene selection algorithm is stated. 

Algorithm 1 BBO for Gene Selection 
l: Initialize BBO parameters 

2: Initialize ecosystem with randomly generated n habitats 
3: Evaluate habitats : calculate the HSI of each habitat in 

the ecosystem ( Cross validation accuracies of each gene 

subset from a classifier) 
4: for G generations do 

5: Compute \ and fa for each Habitati based on its HSI 

6: Perform migration 
7: Perform mutation 
8: Re-evaluate ecosystem 
9: Perform elitism 
10: end for 

11: Output the habitat with best HSI and its SIVs (selected 
genes). 

At each migration or mutation, we ensure that a gene is 
not duplicated within a single subset of genes, i.e. within a 
single habitat. The subset sizes (the variable m) can be set 
during each run of the BBO algorithm. The ecosystem then 
has habitats all with same number of SIVs. This selection 
of subset size is predecided and can be tuned manually by 
running BBO for various subset sizes. 

Migration: The migration procedure of the original BBO 
algorithm [5] is retained. We produce the algorithm here for 
the purpose of completeness. 

Algorithm 2 Migration 



Select Hi with probability oc A, 
if Hi is selected then 
for j = to n do 

Select Hj with probability oc fa 
if Hj is selected then 

Randomly select an SIV a from Hj 
Replace a random SIV in Hi with a 
end if 
end for 
end if 



Mutation with Information Gain based Gene Ranking: 
Given the vast search space formed by the possible genes, we 
keep the mutation rate to about 0.4 to 0.55 in order to graze 
other portions of this space with a good chance. We have 
used information gain heuristics as the additional information 
during mutation. The information gain (IG) 112] of a gene is a 
measure of attribute selection. It stores the 'information con- 
tent' of a gene with respect to the problem under consideration. 



The IG of a gene indicates the capability to separate instances 
for binary classification. We are specially interested in the 
non-zero IG values. Thus we partition the informative and 
non-informative genes into separate sets. For IG computation, 
we have used Weka ITTl data mining software suite, which 
outputs the information gain based ranking of genes. This is 
fed to BBO for further computations. This informative gene set 
with non-zero infogain values are introduced in the population 
during the process of mutation. 

While in mutation, our algorithm either randomly explores 
newer genes or exploits from the available set of genes with 
non-zero infogain values. We set a user defined exploitation 
probability qo- This exploitation is done in a probabilistic 
manner as analogous with the exploration and exploitation in 
ant colony optimization 03), iTBl . EQ, ID, El- To give 
an example, we have a total of 30 expressed genes and only 
the first 8 out of these 30 genes have a non-zero infogain. 
Assuming rand < go is satisfied (step 5 of Algorithm 3), 
then we select one out of these 8 genes, with a probability 
proportional to the information gain ranking scores, to be 
newly put in Habitati in place of an existing one. While on 
the other hand if rand < qo did not get satisfied, we execute 
the else part (step 7 and 8) i.e. randomly select a gene from 
all of the 30 genes expressed in the data. 

Algorithm 3 gives the detailed mutation algorithm. 



Algorithm 3 Mutation 
l: for j : — to m do 

2: Use Xi, Hi of habitat Hi to compute the probability Pi 

3: Select SlV(gene) Hi(j) with probability oc Pj 
4: if Hi(j) is selected then 
5: if rand < qo then 

6: Exploit : Replace Hi(j) with a probabilistically 

selected a SIV (gene) from the rest (using their 
information gain) 

7: else 

8: Explore : Replace Hi(j) with a random SIV (gene) 

out of the rest 

9: end if 
10: end if 
li: end for 



Elitism: We implement elitism so that the best solutions 
obtained until a particular generation do not get corrupted. 

C. Support Vector Machines 

Support Vector Machines (SVMs) ifTsl were originally 
introduced by Vapnik and co-workers [19| and successively 
extended by a number of other researchers. SVM employs a 
maximum margin linear hyperplane for solving binary linear 
classification problems. For non-linearly separable problems, 
SVM first transforms the data into a higher dimensional feature 
and subsequently employs a linear hyperplane. To deal with 
computational intractability issues it further uses appropriate 
kernel functions facilitating all computations in the input space 



itself. Vapnik et al. in EDI have themselves used SVM with 
recursive feature elimination (RFE) for gene selection and 
achieved notably high accuracy levels. We discuss more about 
results in the subsequent section. 

For our purposes we employ the libSVM [21] library for 
evaluation of our candidate solutions during each generation. 

D. Random Forests 

Random Forests (RF) were first introduced by Breimen 
and Cutler l22l . It is an ensemble of randomly constructed 
independent decision trees. It performs substantially better 
than single-tree classifiers such as CART |23] and C4.5 |24|. 
A random subset of attributes are used for node splitting 
while growing each decision tree. Normally, for each tree, a 
bootstrap set (with replacement) is drawn from the original 
training data, i.e. an instance is picked from the training 
data and is replaced again before drawing the next instance. 
Likewise, n such instances are taken to form 'in bag' set for 
a particular tree. For each of the bootstrap training sets, about 
one - third of the samples, on an average, are unused for 
making the 'in bag' data and are called the 'out of bag' (OOB) 
data for that particular tree. The classification tree is built with 
this 'in bag' data using the CART algorithm 1231 . Separate test 
data is not required in RF for checking the overall accuracy of 
the forest. The OOB data is used for cross validation. When 
all the trees are grown, the k th tree classifies the samples that 
are OOB for that tree (left out by the k th tree). In this manner, 
each instance is classified by about one third of the trees. A 
majority vote is then taken to decide on the class label for 
each case. The percentage of times that the voted class label 
is not equal to the original class of a sample, averaged over 
all the cases in the training data, is called as the OOB error 
rate l25l . 

We have used the randomForest package in R for imple- 
mentation purposes |26|. 

III. Discussion and Results 

A. Datasets 

The output of microarray experiments are the expression 
levels of different genes. Three such datasets were obtained 
from the Kent Ridge Biomedical datasets repository [ 27 1 and 
libSVM repository [21] (made available from various other 
original sources). 

The Colon Cancer dataset retrieved from the Kent Ridge 
Biomedical dataset repository consists of 62 instances repre- 
senting cell samples taken from colon cancer patients. Among 
these, 40 are tumor samples while 22 otherwise [28]. The 
breast cancer dataset is retrieved from the DUKE Breast 
Cancer SPORE frozen tissue bank [29 1. Of the 44 samples we 
worked with, each sample with expressions for 7129 genes, 
22 belong to class A (estrogen receptor-positive ER+) while 
22 belong to class B (estrogen receptor-negative ER— ). The 
Leukemia dataset 1 30 1 also retrieved from the Kent Ridge 
Biomedical dataset repository contains the expression of 7129 
genes. These are total 72 samples taken from leukemia patients 
out of which 25 belong to the Acute Myeloid Leukemia 



(AML) class and 47 belong to the Acute Lymphoblastic 
Leukemia (ALL) class. These specifications are tabulated in 
Table I. 



TABLE I 
Dataset Specifications 



Cancer dataset 


#genes 


#classes 


#instances 


name (D) 






(#A & #B) 


Colon (C) 


2000 


2 


62 (40 & 22) 


Breast (B) 


7129 


2 


44 (22 & 22) 


Leukemia (L) 


7129 


2 


72 (25 & 47) 



B. Discussion and Results 

As discussed earlier while describing our algorithms, we 
have implemented BBO with and without heuristics. Very 
interestingly, both simple BBO and BBO with heuristics 
are successful in selecting a good set of features providing 
comparable classification results. As compared to very typical 
implementations of gene selection using genetic algorithms 
and other EAs like PSO (6), Q, which usually work over 
many generations with large population sizes PP . our imple- 
mentation of both versions of BBO (that with SVM and RF) 
for gene selection started showing better results very early 
with just 40-50 habitats in the ecosystem. We performed 50 
simulations each with #generations varying from 15 to 40. 
The algorithms almost always converged to comparable results 
by the end of 25 generations with very minute differences in 
further generations. 




Generation no. 



Fig. 2. Example convergence of population averages of simple vs. heuristic 
BBO over 25 generations; also with example non-monotonic behavior (boxed) 
of average suitability of habitats with simple BBO 

For BBO with heuristics, we decided not to hinder the 
original BBO procedure a lot and incorporated the heuristics 
of information gain of each gene only during the mutation 
process; even this with only some degree and not for every 
mutation. This was done by roping in the analogy of ex- 
ploration and exploitation from the ant colony optimization 
El, lfT5l . Ifl6l method to give the vast number of other 
genes a fair chance of inclusion; this degree of exploitation 



to be controlled by the user depending upon the problem 
at hand. As a result of this, we observed BBO-SVM and 
BBO-RF both to converge to an optimal or near optimal 
solution faster than their counterparts without heuristics. Also, 
the average suitability of habitats in the ecosystem almost 
always shows a monotonic improvement in this case unlike 
the earlier. In effect, it is more like the overall ecosystem 
(population) showing improvement. The inclusion of proba- 
bilistic selection of genes based on entropy, during mutation, 
adds to the improvement of the overall results. This behavior 
was consistently observed over 50 simulations. Figure [2] shows 
an example of how the population average in BBO with 
information gain heuristics converged to higher accuracy in 
lesser generations as compared to simple BBO. The boxed 
portion demonstrates the typically observed non-monotonic 
behavior during some runs of simple BBO as against heuristic 
BBO. 

With regards to the classifiers - SVM and RF - that we 
have used to evaluate BBO selected gene subsets, there has 
been work in literature that has reported to find SVM to 
outperform RF ||32l , both on average and in majority of 
microarray datasets; and our results reiterate the same. We 
have verified this observation by training SVM and RF on the 
same subset of selected genes. The evaluation of each habitat 
to obtain its HSI from the classifier (the CV accuracies) can 
be run in parallel resulting in faster results by speeding up the 
whole process. 

From literature, SVMRFE-RG EOj and Fisher-RG- 
SVMRFE 1 33 1 report high accuracies with SVM for classi- 
fication. But the SVMRFE-RG is unable to tackle redundant 
genes [33 1. While in [ 33 1, the authors have used gene ontology 
to tackle redundant genes during gene selection, we have 
used the information gain based gene ranking. There are also 
other methods as in 11401 who have attempted classification 
of cancer tissue samples with SVMs without feature (gene) 
selection. They have reported approximately 85% accuracy of 
classification (about 10-11 falsely classified samples out of 
nearly 70 in AML/ALL leukemia cancer case). In J2], the 
authors have used Ant Colony Optimization (ACO) for gene 
selection with Ant Miner (AM) and RF for classification. 

The parameters and their corresponding tuned values used 
in our algorithms have been listed in Table II. These values 
were observed to give the most optimum results over extensive 
simulations. 

Results : Table III lists the sizes of gene subsets selected 
by BBO separately run with SVM and RF algorithms and the 
10 - fold cross validation accuracies obtained for the selected 
gene subsets. 

With reference to literature, BBO-SVM and BBO-RF have 
fared well as compared to the previously best performing algo- 
rithms (for the same colon cancer dataset) namely SVMRFE- 
RG ED, Fisher-RG-SVMRFE E3, ACO-AM (Ant Colony 
Optimization- Ant Miner) and ACO-RF [2| which had demon- 
strated accuracies of 93.3, 94.7, 95.47 and 96.77% respec- 
tively [[34] , 1351 . Similarly, the best performing algorithms 



TABLE II 
Tuned algorithm parameters 

a. For BBO with both SVM and RF 



Parameter 



Population size (#candidate solutions for each generation) 

#Generations 

Mutation probability 

Habitat modification probability 

Exploitation probability during mutation (for heuristics) 



Values 



50 
25 
0.70 
1.00 
0.55 



6. For SVM 



Parameter 



Cost 

Gamma (for Radial Basis Function as kernel) 
Folds for cross-validation 



Values 



50 
0.02 
10 



c. For RF 


Parameter 


Values 


Trees in the forest 


500 


Features per tree 


sj f eatures_selected_and_fed_t o_RF 



TABLE III 

Subset sizes and best 10-fold cross validation accuracies 
(CVAs) IN % 



D 


Original 
#genes 


#genes 
selected 
BBO-SVM 


10-fold 
CVA for 
BBO-SVM 


#genes 
selected 
BBO-RF 


10-fold 
CVA for 
BBO-RF 


C 


2000 


09 


98.39 


11 


92.34 


B 


7129 


15 


99.56 


20 


94.38 


L 


7129 


19 


99.60 


20 


93.20 



for leukemia cancer classification have shown accuracies in 
the range 91-97%, with the best being 97.06% 0, EO), 
11331 . 11361 . 11371 . While 0, ED and E3 have worked with 
the same dataset for AML/ALL classification, [36] and ll37ll 
have worked with a different dataset (for Diffuse Large B- 
Cell Lymphoma (DLBCL)) but with similar properties, which 
makes us believe that our proposed algorithm with gene 
selection will also perform equally well as in AML/ALL 
classification. In case of breast cancer, the reported accuracies, 
for the dataset we have worked with, have been in the range 
of 9 1-94% 081 QUI . Very clearly, our method of using BBO 
for gene selection in combination with SVM and RF has 



outperformed them with accuracies as shown in Table III It is 
worth to note that the methods in literature have almost always 
reported their best accuracies. In our work, we have reported 
the average accuracies for both, BBO-SVM and BBO-RF. 

IV. Conclusion 

The hybrid BBO-SVM and BBO-RF techniques have shown 
consistently good results when compared against the highest 
accuracies for colon cancer, breast cancer and leukemia cancer 
datasets. Like other evolutionary algorithms, they are also sim- 
ple to implement, robust and flexible since we can have various 
possible alternatives as suited to the problem and domain 
constraints. A significant speedup in the algorithm may be 
achieved by parallel implementations where the classification 
accuracies for individual candidate solutions may be computed 
in parallel. 



V Future Work 

Like all EAs, our hybrid methods spur many possibilities 
of future work - some problem dependent and some from 
the implementation perspective. From problem representation 
to specific migration and mutation strategies, we can have a 
variety of schemes. For example, with respect to problem rep- 
resentation, the other suitable scheme that one could explore 
for BBO here is: each habitat of the ecosystem could have 
an arbitrary number of attributes (SIVs) which remains fixed 
for itself across generations but may differ from other habitats 
in the ecosystem. This is similar to the variable population 
sizes proposed earlier for BBO and other EAs |39| but at 
a finer granularity, variable solution (habitats or chromosomes 
as the case may be) sizes. This could lead us to a compact 
implementation framework that can simultaneously output the 
better performing subset sizes along with the selected genes. 
Many such variations possible with other EAs could work here 
too. 
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